— title: “Binary Classification Model for Caravan Insurance Marketing Using R Take 2” author: “David Lowe” date: “December 24, 2018” output: html_document: toc: yes —

Template Credit: Adapted from a template made available by Dr. Jason Brownlee of Machine Learning Mastery. [https://machinelearningmastery.com/]

SUMMARY: The purpose of this project is to construct a prediction model using various machine learning algorithms and to document the end-to-end steps using a template. The Insurance Company Benchmark dataset is a classic binary classification situation where we are trying to predict one of the two possible outcomes.

INTRODUCTION: This data set was used in the CoIL 2000 Challenge that contains information on customers of an insurance company. The data consist of 86 variables and include product usage data and socio-demographic data derived from zip codes.

The data was supplied by the Dutch data mining company Sentient Machine Research and is based on a real-world business problem. The training set contains over 5000 descriptions of customers, including the information of whether they have a caravan insurance policy. A test dataset contains another 4000 customers whose information will be used to test the effectiveness of the machine learning models.

The insurance organization collected the data to answer the following question: Can we predict who would be interested in buying a caravan insurance policy and give an explanation why?

In iteration Take1, we had algorithms with high accuracy but with strong biases due to the imbalance of our dataset. For this iteration, we will examine the feasibility of using the SMOTE technique to balance the dataset.

ANALYSIS: From the Take1 iteration, the baseline performance of the seven algorithms achieved an average ROC score of 0.6965. Two algorithms, Decision Tree and Random Forest, achieved the top two ROC scores after the first round of modeling. After a series of tuning trials, Random Forest yielded the top result using the training data. It achieved a ROC score of 0.7159. After using the optimized tuning parameters, the Random Forest algorithm processed the validation dataset with a ROC score of 0.5285, which was significant below the result from the training data.

From the current iteration, the baseline performance of the seven algorithms achieved an average ROC score of 0.9013. Two algorithms, Random Forest and Stochastic Gradient Boosting, achieved the top two ROC scores after the first round of modeling. After a series of tuning trials, Random Forest yielded the top result using the training data. It achieved a ROC score of 0.9243. After using the optimized tuning parameters, the Random Forest algorithm processed the validation dataset with a ROC score of 0.5746, which was significant below the result from the training data.

CONCLUSION: For this iteration, the SMOTE technique improved the unbalanced dataset we have but did not improve the algorithm’s final performance metric. Overall, the Random Forest algorithm achieved the leading ROC scores using the training dataset, but the model failed to perform adequately using the validation dataset. For this dataset, Random Forest still should be considered for further modeling and testing before making it available for production use.

Dataset Used: Insurance Company Benchmark (COIL 2000) Data Set

Dataset ML Model: Binary classification with numerical and categorical attributes

Dataset Reference: https://archive.ics.uci.edu/ml/datasets/Insurance+Company+Benchmark+(COIL+2000)

One potential source of performance benchmark: https://www.kaggle.com/uciml/caravan-insurance-challenge

The project aims to touch on the following areas:

  1. Document a predictive modeling problem end-to-end.
  2. Explore data cleaning and transformation options
  3. Explore non-ensemble and ensemble algorithms for baseline model performance
  4. Explore algorithm tuning techniques for improving model performance

Any predictive modeling machine learning project genrally can be broken down into about six major tasks:

  1. Prepare Problem
  2. Summarize Data
  3. Prepare Data
  4. Model and Evaluate Algorithms
  5. Improve Accuracy or Results
  6. Finalize Model and Present Results

1. Prepare Problem

1.a) Load libraries

startTimeScript <- proc.time()
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(corrplot)
## corrplot 0.84 loaded
library(mailR)
library(parallel)
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(stringr)
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
library(DMwR)
## Loading required package: grid
# Create one random seed number for reproducible results
seedNum <- 888
set.seed(seedNum)

1.b) Set up the email notification function

email_notify <- function(msg=""){
  sender <- "luozhi2488@gmail.com"
  receiver <- "dave@contactdavidlowe.com"
  sbj_line <- "Notification from R Script"
  password <- readLines("../email_credential.txt")
  send.mail(
    from = sender,
    to = receiver,
    subject= sbj_line,
    body = msg,
    smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = sender, passwd = password, ssl = TRUE),
    authenticate = TRUE,
    send = TRUE)
}
email_notify(paste("Library and Data Loading has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@47fd17e3}"

1.c) Load dataset

# Read the list of attribute names from a file
attrFile = "TicAttributes.txt"
conn <- file(attrFile, open="r")
lines <- readLines(conn)
close(conn)
colNames <- c()
for (i in 1:length(lines)) {
  colNames <- c(colNames,word(lines[i]))
}

# Import the records for the training dataset
inputFile = "ticdata2000.txt"
xy_train <- read.csv(inputFile, header = FALSE, sep = "\t", col.names = colNames)

# Standardize the class column to the name of targetVar
xy_train$targetVar <- "Yes"
xy_train$targetVar[xy_train$CARAVAN==0] <- "No"
xy_train$targetVar <- as.factor(xy_train$targetVar)
xy_train$targetVar <- relevel(xy_train$targetVar, "Yes")
xy_train$CARAVAN <- NULL
cat("Number of training rows and columns imported into xy_train:", nrow(xy_train), "by", ncol(xy_train), "\n")
## Number of training rows and columns imported into xy_train: 5822 by 86
# Import the records for the test/eval dataset without the target variable
noTargetCol <- colNames[-length(colNames)]
inputFile = "ticeval2000.txt"
x_test <- read.csv(inputFile, header = FALSE, sep = "\t", col.names = noTargetCol)
cat("Number of training rows and columns imported into x_test:", nrow(x_test), "by", ncol(x_test), "\n")
## Number of training rows and columns imported into x_test: 4000 by 85
# Import the records for the test/eval dataset with only the target variable
inputFile = "tictgts2000.txt"
y_test <- read.csv(inputFile, header = FALSE, col.names = c("CARAVAN"))
y_test$targetVar <- "Yes"
y_test$targetVar[y_test$CARAVAN==0] <- "No"
y_test$targetVar <- as.factor(y_test$targetVar)
y_test$targetVar <- relevel(y_test$targetVar, "Yes")
y_test$CARAVAN <- NULL
cat("Number of training rows and columns imported into y_test:", nrow(y_test), "by", ncol(y_test), "\n")
## Number of training rows and columns imported into y_test: 4000 by 1
# Use variable totCol to hold the number of columns in the dataframe
totCol <- ncol(xy_train)

# Set up variable totAttr for the total number of attribute columns
totAttr <- totCol-1
# targetCol variable indicates the column location of the target/class variable
# If the first column, set targetCol to 1. If the last column, set targetCol to totCol
# if (targetCol <> 1) and (targetCol <> totCol), be aware when slicing up the dataframes for visualization! 
targetCol <- totCol
# We create training datasets (xy_train, x_train, y_train) for various operations.
# We create validation datasets (xy_test, x_test, y_test) for various operations.
set.seed(seedNum)

# Create a list of the rows in the original dataset we can use for training
# training_index <- createDataPartition(originalDataset$targetVar, p=0.70, list=FALSE)
# Use 70% of the data to train the models and the remaining for testing/validation
# xy_train <- originalDataset[training_index,]
# xy_test <- originalDataset[-training_index,]

if (targetCol==1) {
  x_train <- xy_train[,(targetCol+1):totCol]
  y_train <- xy_train[,targetCol]
  xy_test <- cbind(y_test, x_test)
  y_test <- xy_test[,targetCol]
} else {
  x_train <- xy_train[,1:(totAttr)]
  y_train <- xy_train[,totCol]
  xy_test <- cbind(x_test, y_test)
  y_test <- xy_test[,targetCol]
}

1.d) Set up the key parameters to be used in the script

# Set up the number of row and columns for visualization display. dispRow * dispCol should be >= totAttr
dispCol <- 5
if (totAttr%%dispCol == 0) {
dispRow <- totAttr%/%dispCol
} else {
dispRow <- (totAttr%/%dispCol) + 1
}
cat("Will attempt to create graphics grid (col x row): ", dispCol, ' by ', dispRow)
## Will attempt to create graphics grid (col x row):  5  by  17

1.e) Set test options and evaluation metric

# Run algorithms using 10-fold cross validation
control <- trainControl(method="repeatedcv", number=10, repeats=1, classProbs=TRUE, savePredictions=TRUE, summaryFunction=twoClassSummary)
metricTarget <- "ROC"
email_notify(paste("Library and Data Loading completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@34340fab}"

2. Summarize Data

To gain a better understanding of the data that we have on-hand, we will leverage a number of descriptive statistics and data visualization techniques. The plan is to use the results to consider new questions, review assumptions, and validate hypotheses that we can investigate later with specialized models.

email_notify(paste("Data Summarization and Visualization has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@546a03af}"

2.a) Descriptive statistics

2.a.i) Peek at the data itself.

head(xy_train)
##   MOSTYPE MAANTHUI MGEMOMV MGEMLEEF MOSHOOFD MGODRK MGODPR MGODOV MGODGE
## 1      33        1       3        2        8      0      5      1      3
## 2      37        1       2        2        8      1      4      1      4
## 3      37        1       2        2        8      0      4      2      4
## 4       9        1       3        3        3      2      3      2      4
## 5      40        1       4        2       10      1      4      1      4
## 6      23        1       2        1        5      0      5      0      5
##   MRELGE MRELSA MRELOV MFALLEEN MFGEKIND MFWEKIND MOPLHOOG MOPLMIDD
## 1      7      0      2        1        2        6        1        2
## 2      6      2      2        0        4        5        0        5
## 3      3      2      4        4        4        2        0        5
## 4      5      2      2        2        3        4        3        4
## 5      7      1      2        2        4        4        5        4
## 6      0      6      3        3        5        2        0        5
##   MOPLLAAG MBERHOOG MBERZELF MBERBOER MBERMIDD MBERARBG MBERARBO MSKA
## 1        7        1        0        1        2        5        2    1
## 2        4        0        0        0        5        0        4    0
## 3        4        0        0        0        7        0        2    0
## 4        2        4        0        0        3        1        2    3
## 5        0        0        5        4        0        0        0    9
## 6        4        2        0        0        4        2        2    2
##   MSKB1 MSKB2 MSKC MSKD MHHUUR MHKOOP MAUT1 MAUT2 MAUT0 MZFONDS MZPART
## 1     1     2    6    1      1      8     8     0     1       8      1
## 2     2     3    5    0      2      7     7     1     2       6      3
## 3     5     0    4    0      7      2     7     0     2       9      0
## 4     2     1    4    0      5      4     9     0     0       7      2
## 5     0     0    0    0      4      5     6     2     1       5      4
## 6     2     2    4    2      9      0     5     3     3       9      0
##   MINKM30 MINK3045 MINK4575 MINK7512 MINK123M MINKGEM MKOOPKLA PWAPART
## 1       0        4        5        0        0       4        3       0
## 2       2        0        5        2        0       5        4       2
## 3       4        5        0        0        0       3        4       2
## 4       1        5        3        0        0       4        4       0
## 5       0        0        9        0        0       6        3       0
## 6       5        2        3        0        0       3        3       0
##   PWABEDR PWALAND PPERSAUT PBESAUT PMOTSCO PVRAAUT PAANHANG PTRACTOR
## 1       0       0        6       0       0       0        0        0
## 2       0       0        0       0       0       0        0        0
## 3       0       0        6       0       0       0        0        0
## 4       0       0        6       0       0       0        0        0
## 5       0       0        0       0       0       0        0        0
## 6       0       0        6       0       0       0        0        0
##   PWERKT PBROM PLEVEN PPERSONG PGEZONG PWAOREG PBRAND PZEILPL PPLEZIER
## 1      0     0      0        0       0       0      5       0        0
## 2      0     0      0        0       0       0      2       0        0
## 3      0     0      0        0       0       0      2       0        0
## 4      0     0      0        0       0       0      2       0        0
## 5      0     0      0        0       0       0      6       0        0
## 6      0     0      0        0       0       0      0       0        0
##   PFIETS PINBOED PBYSTAND AWAPART AWABEDR AWALAND APERSAUT ABESAUT AMOTSCO
## 1      0       0        0       0       0       0        1       0       0
## 2      0       0        0       2       0       0        0       0       0
## 3      0       0        0       1       0       0        1       0       0
## 4      0       0        0       0       0       0        1       0       0
## 5      0       0        0       0       0       0        0       0       0
## 6      0       0        0       0       0       0        1       0       0
##   AVRAAUT AAANHANG ATRACTOR AWERKT ABROM ALEVEN APERSONG AGEZONG AWAOREG
## 1       0        0        0      0     0      0        0       0       0
## 2       0        0        0      0     0      0        0       0       0
## 3       0        0        0      0     0      0        0       0       0
## 4       0        0        0      0     0      0        0       0       0
## 5       0        0        0      0     0      0        0       0       0
## 6       0        0        0      0     0      0        0       0       0
##   ABRAND AZEILPL APLEZIER AFIETS AINBOED ABYSTAND targetVar
## 1      1       0        0      0       0        0        No
## 2      1       0        0      0       0        0        No
## 3      1       0        0      0       0        0        No
## 4      1       0        0      0       0        0        No
## 5      1       0        0      0       0        0        No
## 6      0       0        0      0       0        0        No

2.a.ii) Dimensions of the dataset.

dim(xy_train)
## [1] 5822   86

2.a.iii) Types of the attributes.

sapply(xy_train, class)
##   MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##    MGODOV    MGODGE    MRELGE    MRELSA    MRELOV  MFALLEEN  MFGEKIND 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  MFWEKIND  MOPLHOOG  MOPLMIDD  MOPLLAAG  MBERHOOG  MBERZELF  MBERBOER 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  MBERMIDD  MBERARBG  MBERARBO      MSKA     MSKB1     MSKB2      MSKC 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##      MSKD    MHHUUR    MHKOOP     MAUT1     MAUT2     MAUT0   MZFONDS 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##    MZPART   MINKM30  MINK3045  MINK4575  MINK7512  MINK123M   MINKGEM 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  MKOOPKLA   PWAPART   PWABEDR   PWALAND  PPERSAUT   PBESAUT   PMOTSCO 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   PVRAAUT  PAANHANG  PTRACTOR    PWERKT     PBROM    PLEVEN  PPERSONG 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   PGEZONG   PWAOREG    PBRAND   PZEILPL  PPLEZIER    PFIETS   PINBOED 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  PBYSTAND   AWAPART   AWABEDR   AWALAND  APERSAUT   ABESAUT   AMOTSCO 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   AVRAAUT  AAANHANG  ATRACTOR    AWERKT     ABROM    ALEVEN  APERSONG 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##   AGEZONG   AWAOREG    ABRAND   AZEILPL  APLEZIER    AFIETS   AINBOED 
## "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
##  ABYSTAND targetVar 
## "integer"  "factor"

2.a.iv) Statistical summary of all attributes.

summary(xy_train)
##     MOSTYPE         MAANTHUI         MGEMOMV         MGEMLEEF    
##  Min.   : 1.00   Min.   : 1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:10.00   1st Qu.: 1.000   1st Qu.:2.000   1st Qu.:2.000  
##  Median :30.00   Median : 1.000   Median :3.000   Median :3.000  
##  Mean   :24.25   Mean   : 1.111   Mean   :2.679   Mean   :2.991  
##  3rd Qu.:35.00   3rd Qu.: 1.000   3rd Qu.:3.000   3rd Qu.:3.000  
##  Max.   :41.00   Max.   :10.000   Max.   :5.000   Max.   :6.000  
##     MOSHOOFD          MGODRK           MGODPR          MGODOV    
##  Min.   : 1.000   Min.   :0.0000   Min.   :0.000   Min.   :0.00  
##  1st Qu.: 3.000   1st Qu.:0.0000   1st Qu.:4.000   1st Qu.:0.00  
##  Median : 7.000   Median :0.0000   Median :5.000   Median :1.00  
##  Mean   : 5.774   Mean   :0.6965   Mean   :4.627   Mean   :1.07  
##  3rd Qu.: 8.000   3rd Qu.:1.0000   3rd Qu.:6.000   3rd Qu.:2.00  
##  Max.   :10.000   Max.   :9.0000   Max.   :9.000   Max.   :5.00  
##      MGODGE          MRELGE          MRELSA           MRELOV    
##  Min.   :0.000   Min.   :0.000   Min.   :0.0000   Min.   :0.00  
##  1st Qu.:2.000   1st Qu.:5.000   1st Qu.:0.0000   1st Qu.:1.00  
##  Median :3.000   Median :6.000   Median :1.0000   Median :2.00  
##  Mean   :3.259   Mean   :6.183   Mean   :0.8835   Mean   :2.29  
##  3rd Qu.:4.000   3rd Qu.:7.000   3rd Qu.:1.0000   3rd Qu.:3.00  
##  Max.   :9.000   Max.   :9.000   Max.   :7.0000   Max.   :9.00  
##     MFALLEEN        MFGEKIND       MFWEKIND      MOPLHOOG    
##  Min.   :0.000   Min.   :0.00   Min.   :0.0   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:2.00   1st Qu.:3.0   1st Qu.:0.000  
##  Median :2.000   Median :3.00   Median :4.0   Median :1.000  
##  Mean   :1.888   Mean   :3.23   Mean   :4.3   Mean   :1.461  
##  3rd Qu.:3.000   3rd Qu.:4.00   3rd Qu.:6.0   3rd Qu.:2.000  
##  Max.   :9.000   Max.   :9.00   Max.   :9.0   Max.   :9.000  
##     MOPLMIDD        MOPLLAAG        MBERHOOG        MBERZELF    
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:2.000   1st Qu.:3.000   1st Qu.:0.000   1st Qu.:0.000  
##  Median :3.000   Median :5.000   Median :2.000   Median :0.000  
##  Mean   :3.351   Mean   :4.572   Mean   :1.895   Mean   :0.398  
##  3rd Qu.:4.000   3rd Qu.:6.000   3rd Qu.:3.000   3rd Qu.:1.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :5.000  
##     MBERBOER         MBERMIDD        MBERARBG       MBERARBO    
##  Min.   :0.0000   Min.   :0.000   Min.   :0.00   Min.   :0.000  
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.00   1st Qu.:1.000  
##  Median :0.0000   Median :3.000   Median :2.00   Median :2.000  
##  Mean   :0.5223   Mean   :2.899   Mean   :2.22   Mean   :2.306  
##  3rd Qu.:1.0000   3rd Qu.:4.000   3rd Qu.:3.00   3rd Qu.:3.000  
##  Max.   :9.0000   Max.   :9.000   Max.   :9.00   Max.   :9.000  
##       MSKA           MSKB1           MSKB2            MSKC      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:2.000  
##  Median :1.000   Median :2.000   Median :2.000   Median :4.000  
##  Mean   :1.621   Mean   :1.607   Mean   :2.203   Mean   :3.759  
##  3rd Qu.:2.000   3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:5.000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##       MSKD           MHHUUR          MHKOOP          MAUT1     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.00  
##  1st Qu.:0.000   1st Qu.:2.000   1st Qu.:2.000   1st Qu.:5.00  
##  Median :1.000   Median :4.000   Median :5.000   Median :6.00  
##  Mean   :1.067   Mean   :4.237   Mean   :4.772   Mean   :6.04  
##  3rd Qu.:2.000   3rd Qu.:7.000   3rd Qu.:7.000   3rd Qu.:7.00  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.00  
##      MAUT2           MAUT0          MZFONDS          MZPART     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.000   1st Qu.:1.000   1st Qu.:5.000   1st Qu.:1.000  
##  Median :1.000   Median :2.000   Median :7.000   Median :2.000  
##  Mean   :1.316   Mean   :1.959   Mean   :6.277   Mean   :2.729  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.:8.000   3rd Qu.:4.000  
##  Max.   :7.000   Max.   :9.000   Max.   :9.000   Max.   :9.000  
##     MINKM30         MINK3045        MINK4575        MINK7512     
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   1st Qu.:0.0000  
##  Median :2.000   Median :4.000   Median :3.000   Median :0.0000  
##  Mean   :2.574   Mean   :3.536   Mean   :2.731   Mean   :0.7961  
##  3rd Qu.:4.000   3rd Qu.:5.000   3rd Qu.:4.000   3rd Qu.:1.0000  
##  Max.   :9.000   Max.   :9.000   Max.   :9.000   Max.   :9.0000  
##     MINK123M         MINKGEM         MKOOPKLA        PWAPART      
##  Min.   :0.0000   Min.   :0.000   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:3.000   1st Qu.:3.000   1st Qu.:0.0000  
##  Median :0.0000   Median :4.000   Median :4.000   Median :0.0000  
##  Mean   :0.2027   Mean   :3.784   Mean   :4.236   Mean   :0.7712  
##  3rd Qu.:0.0000   3rd Qu.:4.000   3rd Qu.:6.000   3rd Qu.:2.0000  
##  Max.   :9.0000   Max.   :9.000   Max.   :8.000   Max.   :3.0000  
##     PWABEDR           PWALAND           PPERSAUT       PBESAUT       
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :5.00   Median :0.00000  
##  Mean   :0.04002   Mean   :0.07162   Mean   :2.97   Mean   :0.04827  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:6.00   3rd Qu.:0.00000  
##  Max.   :6.00000   Max.   :4.00000   Max.   :8.00   Max.   :7.00000  
##     PMOTSCO          PVRAAUT            PAANHANG          PTRACTOR      
##  Min.   :0.0000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.0000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.0000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.1754   Mean   :0.009447   Mean   :0.02096   Mean   :0.09258  
##  3rd Qu.:0.0000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :7.0000   Max.   :9.000000   Max.   :5.00000   Max.   :6.00000  
##      PWERKT            PBROM           PLEVEN          PPERSONG      
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000   Median :0.0000   Median :0.00000  
##  Mean   :0.01305   Mean   :0.215   Mean   :0.1948   Mean   :0.01374  
##  3rd Qu.:0.00000   3rd Qu.:0.000   3rd Qu.:0.0000   3rd Qu.:0.00000  
##  Max.   :6.00000   Max.   :6.000   Max.   :9.0000   Max.   :6.00000  
##     PGEZONG           PWAOREG            PBRAND         PZEILPL         
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000   Min.   :0.0000000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000   1st Qu.:0.0000000  
##  Median :0.00000   Median :0.00000   Median :2.000   Median :0.0000000  
##  Mean   :0.01529   Mean   :0.02353   Mean   :1.828   Mean   :0.0008588  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:4.000   3rd Qu.:0.0000000  
##  Max.   :3.00000   Max.   :7.00000   Max.   :8.000   Max.   :3.0000000  
##     PPLEZIER           PFIETS           PINBOED           PBYSTAND      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.00000   Median :0.00000  
##  Mean   :0.01889   Mean   :0.02525   Mean   :0.01563   Mean   :0.04758  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :6.00000   Max.   :1.00000   Max.   :6.00000   Max.   :5.00000  
##     AWAPART         AWABEDR           AWALAND           APERSAUT     
##  Min.   :0.000   Min.   :0.00000   Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.000   1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.000   Median :0.00000   Median :0.00000   Median :1.0000  
##  Mean   :0.403   Mean   :0.01477   Mean   :0.02061   Mean   :0.5622  
##  3rd Qu.:1.000   3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:1.0000  
##  Max.   :2.000   Max.   :5.00000   Max.   :1.00000   Max.   :7.0000  
##     ABESAUT           AMOTSCO           AVRAAUT            AAANHANG      
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.00000   Median :0.000000   Median :0.00000  
##  Mean   :0.01048   Mean   :0.04105   Mean   :0.002233   Mean   :0.01254  
##  3rd Qu.:0.00000   3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :4.00000   Max.   :8.00000   Max.   :3.000000   Max.   :3.00000  
##     ATRACTOR           AWERKT             ABROM             ALEVEN       
##  Min.   :0.00000   Min.   :0.000000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.00000   1st Qu.:0.000000   1st Qu.:0.00000   1st Qu.:0.00000  
##  Median :0.00000   Median :0.000000   Median :0.00000   Median :0.00000  
##  Mean   :0.03367   Mean   :0.006183   Mean   :0.07042   Mean   :0.07661  
##  3rd Qu.:0.00000   3rd Qu.:0.000000   3rd Qu.:0.00000   3rd Qu.:0.00000  
##  Max.   :4.00000   Max.   :6.000000   Max.   :2.00000   Max.   :8.00000  
##     APERSONG           AGEZONG            AWAOREG             ABRAND      
##  Min.   :0.000000   Min.   :0.000000   Min.   :0.000000   Min.   :0.0000  
##  1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.000000   1st Qu.:0.0000  
##  Median :0.000000   Median :0.000000   Median :0.000000   Median :1.0000  
##  Mean   :0.005325   Mean   :0.006527   Mean   :0.004638   Mean   :0.5701  
##  3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:0.000000   3rd Qu.:1.0000  
##  Max.   :1.000000   Max.   :1.000000   Max.   :2.000000   Max.   :7.0000  
##     AZEILPL             APLEZIER            AFIETS       
##  Min.   :0.0000000   Min.   :0.000000   Min.   :0.00000  
##  1st Qu.:0.0000000   1st Qu.:0.000000   1st Qu.:0.00000  
##  Median :0.0000000   Median :0.000000   Median :0.00000  
##  Mean   :0.0005153   Mean   :0.006012   Mean   :0.03178  
##  3rd Qu.:0.0000000   3rd Qu.:0.000000   3rd Qu.:0.00000  
##  Max.   :1.0000000   Max.   :2.000000   Max.   :3.00000  
##     AINBOED            ABYSTAND       targetVar 
##  Min.   :0.000000   Min.   :0.00000   Yes: 348  
##  1st Qu.:0.000000   1st Qu.:0.00000   No :5474  
##  Median :0.000000   Median :0.00000             
##  Mean   :0.007901   Mean   :0.01426             
##  3rd Qu.:0.000000   3rd Qu.:0.00000             
##  Max.   :2.000000   Max.   :2.00000

2.a.v) Summarize the levels of the class attribute.

#entireDataset_x <- entireDataset[,1:(totCol-1)]
#entireDataset_y <- entireDataset[,totCol]
cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
##     freq percentage
## Yes  348   5.977327
## No  5474  94.022673

2.a.vi) Count missing values.

sapply(xy_train, function(x) sum(is.na(x)))
##   MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR 
##         0         0         0         0         0         0         0 
##    MGODOV    MGODGE    MRELGE    MRELSA    MRELOV  MFALLEEN  MFGEKIND 
##         0         0         0         0         0         0         0 
##  MFWEKIND  MOPLHOOG  MOPLMIDD  MOPLLAAG  MBERHOOG  MBERZELF  MBERBOER 
##         0         0         0         0         0         0         0 
##  MBERMIDD  MBERARBG  MBERARBO      MSKA     MSKB1     MSKB2      MSKC 
##         0         0         0         0         0         0         0 
##      MSKD    MHHUUR    MHKOOP     MAUT1     MAUT2     MAUT0   MZFONDS 
##         0         0         0         0         0         0         0 
##    MZPART   MINKM30  MINK3045  MINK4575  MINK7512  MINK123M   MINKGEM 
##         0         0         0         0         0         0         0 
##  MKOOPKLA   PWAPART   PWABEDR   PWALAND  PPERSAUT   PBESAUT   PMOTSCO 
##         0         0         0         0         0         0         0 
##   PVRAAUT  PAANHANG  PTRACTOR    PWERKT     PBROM    PLEVEN  PPERSONG 
##         0         0         0         0         0         0         0 
##   PGEZONG   PWAOREG    PBRAND   PZEILPL  PPLEZIER    PFIETS   PINBOED 
##         0         0         0         0         0         0         0 
##  PBYSTAND   AWAPART   AWABEDR   AWALAND  APERSAUT   ABESAUT   AMOTSCO 
##         0         0         0         0         0         0         0 
##   AVRAAUT  AAANHANG  ATRACTOR    AWERKT     ABROM    ALEVEN  APERSONG 
##         0         0         0         0         0         0         0 
##   AGEZONG   AWAOREG    ABRAND   AZEILPL  APLEZIER    AFIETS   AINBOED 
##         0         0         0         0         0         0         0 
##  ABYSTAND targetVar 
##         0         0

2.b) Data visualizations

2.b.i) Univariate plots to better understand each attribute.

# Boxplots for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    boxplot(x_train[,i], main=names(x_train)[i])
}

# Histograms each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    hist(x_train[,i], main=names(x_train)[i])
}

# Density plot for each attribute
# par(mfrow=c(dispRow,dispCol))
for(i in 1:totAttr) {
    plot(density(x_train[,i]), main=names(x_train)[i])
}

2.b.ii) Multivariate plots to better understand the relationships between attributes

# Scatterplot matrix colored by class
# pairs(targetVar~., data=xy_train, col=xy_train$targetVar)
# Box and whisker plots for each attribute by class
scales <- list(x=list(relation="free"), y=list(relation="free"))
featurePlot(x=x_train, y=y_train, plot="box", scales=scales)

# Density plots for each attribute by class value
featurePlot(x=x_train, y=y_train, plot="density", scales=scales)

# Correlation plot
correlations <- cor(x_train)
corrplot(correlations, method="circle")

email_notify(paste("Data Summarization and Visualization completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@28f67ac7}"

3. Prepare Data

Some dataset may require additional preparation activities that will best exposes the structure of the problem and the relationships between the input attributes and the output variable. Some data-prep tasks might include:

email_notify(paste("Data Cleaning and Transformation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4ec6a292}"

3.a) Data Cleaning

# Accodring to the data dictionary, columns MOSTYPE and MOSHOOFD should be converted to categorical type
xy_train$MOSTYPE <- as.factor(xy_train$MOSTYPE)
xy_train$MOSHOOFD <- as.factor(xy_train$MOSHOOFD)
xy_test$MOSTYPE <- as.factor(xy_test$MOSTYPE)
xy_test$MOSHOOFD <- as.factor(xy_test$MOSHOOFD)

3.b) Feature Selection

# Not applicable for this iteration of the project.

3.c) Data Transforms

# Perform SMOTE transformation to combat the imbalance of the data
set.seed(seedNum)
xy_train <- SMOTE(targetVar ~., data=xy_train, perc.over=200, perc.under=300)
totCol <- ncol(xy_train)
y_train <- xy_train[,totCol]
cbind(freq=table(y_train), percentage=prop.table(table(y_train))*100)
##     freq percentage
## Yes 1044   33.33333
## No  2088   66.66667

3.d) Display the Final Dataset for Model-Building

dim(xy_train)
## [1] 3132   86
sapply(xy_train, class)
##   MOSTYPE  MAANTHUI   MGEMOMV  MGEMLEEF  MOSHOOFD    MGODRK    MGODPR 
##  "factor" "numeric" "numeric" "numeric"  "factor" "numeric" "numeric" 
##    MGODOV    MGODGE    MRELGE    MRELSA    MRELOV  MFALLEEN  MFGEKIND 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##  MFWEKIND  MOPLHOOG  MOPLMIDD  MOPLLAAG  MBERHOOG  MBERZELF  MBERBOER 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##  MBERMIDD  MBERARBG  MBERARBO      MSKA     MSKB1     MSKB2      MSKC 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##      MSKD    MHHUUR    MHKOOP     MAUT1     MAUT2     MAUT0   MZFONDS 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##    MZPART   MINKM30  MINK3045  MINK4575  MINK7512  MINK123M   MINKGEM 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##  MKOOPKLA   PWAPART   PWABEDR   PWALAND  PPERSAUT   PBESAUT   PMOTSCO 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##   PVRAAUT  PAANHANG  PTRACTOR    PWERKT     PBROM    PLEVEN  PPERSONG 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##   PGEZONG   PWAOREG    PBRAND   PZEILPL  PPLEZIER    PFIETS   PINBOED 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##  PBYSTAND   AWAPART   AWABEDR   AWALAND  APERSAUT   ABESAUT   AMOTSCO 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##   AVRAAUT  AAANHANG  ATRACTOR    AWERKT     ABROM    ALEVEN  APERSONG 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##   AGEZONG   AWAOREG    ABRAND   AZEILPL  APLEZIER    AFIETS   AINBOED 
## "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" 
##  ABYSTAND targetVar 
## "numeric"  "factor"
email_notify(paste("Data Cleaning and Transformation completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@71c7db30}"
proc.time()-startTimeScript
##    user  system elapsed 
##  58.169   0.794  67.081

4. Model and Evaluate Algorithms

After the data-prep, we next work on finding a workable model by evaluating a subset of machine learning algorithms that are good at exploiting the structure of the training. The typical evaluation tasks include:

For this project, we will evaluate one linear, three non-linear, and three ensemble algorithms:

Linear Algorithm: Logistic Regression

Non-Linear Algorithms: Decision Trees (CART), k-Nearest Neighbors, and Support Vector Machine

Ensemble Algorithms: Bagged CART, Random Forest, and Stochastic Gradient Boosting

The random number seed is reset before each run to ensure that the evaluation of each algorithm is performed using the same data splits. It ensures the results are directly comparable.

4.a) Generate models using linear algorithms

# Logistic Regression (Classification)
email_notify(paste("Linear Regression modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@45c8e616}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.glm <- train(targetVar~., data=xy_train, method="glm", metric=metricTarget, trControl=control)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
print(fit.glm)
## Generalized Linear Model 
## 
## 3132 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 2819, 2820, 2818, 2819, 2818, 2819, ... 
## Resampling results:
## 
##   ROC        Sens       Spec     
##   0.8672422  0.6532784  0.8941595
proc.time()-startTimeModule
##    user  system elapsed 
##   8.736   0.067   8.903
email_notify(paste("Linear Regression modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3f0ee7cb}"

4.b) Generate models using nonlinear algorithms

# Decision Tree - CART (Regression/Classification)
email_notify(paste("Decision Tree modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@5a39699c}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.cart <- train(targetVar~., data=xy_train, method="rpart", metric=metricTarget, trControl=control)
print(fit.cart)
## CART 
## 
## 3132 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 2819, 2820, 2818, 2819, 2818, 2819, ... 
## Resampling results across tuning parameters:
## 
##   cp          ROC        Sens       Spec     
##   0.03799489  0.8070917  0.5391117  0.9540187
##   0.04022989  0.7950223  0.5036905  0.9573772
##   0.17385057  0.6979838  0.2740110  1.0000000
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03799489.
proc.time()-startTimeModule
##    user  system elapsed 
##   3.458   0.002   3.500
email_notify(paste("Decision Tree modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@1b0375b3}"
# k-Nearest Neighbors (Regression/Classification)
email_notify(paste("k-Nearest Neighbors modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@e580929}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.knn <- train(targetVar~., data=xy_train, method="knn", metric=metricTarget, trControl=control)
print(fit.knn)
## k-Nearest Neighbors 
## 
## 3132 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 2819, 2820, 2818, 2819, 2818, 2819, ... 
## Resampling results across tuning parameters:
## 
##   k  ROC        Sens       Spec     
##   5  0.8728679  0.7041300  0.8524775
##   7  0.8714302  0.7136996  0.8577544
##   9  0.8748909  0.7155769  0.8592013
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
proc.time()-startTimeModule
##    user  system elapsed 
##  10.527   0.003  10.645
email_notify(paste("k-Nearest Neighbors modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3ac3fd8b}"
# Support Vector Machine (Regression/Classification)
email_notify(paste("Support Vector Machine modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7a79be86}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.svm <- train(targetVar~., data=xy_train, method="svmRadial", metric=metricTarget, trControl=control)
## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.

## Warning in .local(x, ...): Variable(s) `' constant. Cannot scale data.
print(fit.svm)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 3132 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 2819, 2820, 2818, 2819, 2818, 2819, ... 
## Resampling results across tuning parameters:
## 
##   C     ROC        Sens       Spec     
##   0.25  0.8903108  0.6647253  0.9329476
##   0.50  0.9010262  0.6838919  0.9343923
##   1.00  0.9110365  0.7002015  0.9391861
## 
## Tuning parameter 'sigma' was held constant at a value of 0.005284753
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.005284753 and C = 1.
proc.time()-startTimeModule
##    user  system elapsed 
##  89.754   0.253  91.018
email_notify(paste("Support Vector Machine modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@57829d67}"

4.c) Generate models using ensemble algorithms

In this section, we will explore the use and tuning of ensemble algorithms to see whether we can improve the results.

# Bagged CART (Regression/Classification)
email_notify(paste("Bagged CART modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@7a0ac6e3}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.bagcart <- train(targetVar~., data=xy_train, method="treebag", metric=metricTarget, trControl=control)
print(fit.bagcart)
## Bagged CART 
## 
## 3132 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 2819, 2820, 2818, 2819, 2818, 2819, ... 
## Resampling results:
## 
##   ROC       Sens      Spec     
##   0.911936  0.732793  0.9478124
proc.time()-startTimeModule
##    user  system elapsed 
##  46.154   0.257  46.972
email_notify(paste("Bagged CART modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@3498ed}"
# Random Forest (Regression/Classification)
email_notify(paste("Random Forest modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@668bc3d5}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.rf <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, trControl=control)
print(fit.rf)
## Random Forest 
## 
## 3132 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 2819, 2820, 2818, 2819, 2818, 2819, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##     2   0.9188756  0.6302381  0.9961699
##    66   0.9222141  0.7356410  0.9554679
##   131   0.9219312  0.7318040  0.9549894
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 66.
proc.time()-startTimeModule
##    user  system elapsed 
## 519.129   0.783 526.046
email_notify(paste("Random Forest modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@79698539}"
# Stochastic Gradient Boosting (Regression/Classification)
email_notify(paste("Stochastic Gradient Boosting modeling has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@26ba2a48}"
startTimeModule <- proc.time()
set.seed(seedNum)
fit.gbm <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, trControl=control, verbose=F)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
print(fit.gbm)
## Stochastic Gradient Boosting 
## 
## 3132 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 2819, 2820, 2818, 2819, 2818, 2819, ... 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  ROC        Sens       Spec     
##   1                   50      0.8839905  0.6206502  0.9602549
##   1                  100      0.8983872  0.6733333  0.9568987
##   1                  150      0.9027951  0.6838828  0.9607310
##   2                   50      0.9060061  0.6542308  0.9861083
##   2                  100      0.9134373  0.6887271  0.9760536
##   2                  150      0.9164903  0.6973352  0.9693481
##   3                   50      0.9124517  0.6829853  0.9803690
##   3                  100      0.9211726  0.7069597  0.9736612
##   3                  150      0.9215802  0.7213004  0.9669603
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
##  46.595   0.065  47.196
email_notify(paste("Stochastic Gradient Boosting modeling completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@246ae04d}"

4.d) Compare baseline algorithms

results <- resamples(list(LR=fit.glm, CART=fit.cart, kNN=fit.knn, SVM=fit.svm, BagCART=fit.bagcart, RF=fit.rf, GBM=fit.gbm))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: LR, CART, kNN, SVM, BagCART, RF, GBM 
## Number of resamples: 10 
## 
## ROC 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR      0.8303736 0.8490427 0.8699678 0.8672422 0.8824893 0.8976054    0
## CART    0.7422479 0.7832203 0.8174457 0.8070917 0.8287928 0.8543860    0
## kNN     0.8395306 0.8685499 0.8730918 0.8748909 0.8845115 0.8966001    0
## SVM     0.8840510 0.9039884 0.9115892 0.9110365 0.9194293 0.9339805    0
## BagCART 0.8893825 0.9051435 0.9127281 0.9119360 0.9208966 0.9298629    0
## RF      0.9059011 0.9184097 0.9233760 0.9222141 0.9296881 0.9322994    0
## GBM     0.9106416 0.9123091 0.9223868 0.9215802 0.9284008 0.9347004    0
## 
## Sens 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR      0.6000000 0.6253434 0.6554945 0.6532784 0.6778846 0.7142857    0
## CART    0.3750000 0.4440247 0.5548077 0.5391117 0.6467720 0.6538462    0
## kNN     0.6476190 0.6778159 0.7272894 0.7155769 0.7493819 0.7692308    0
## SVM     0.6476190 0.6826923 0.6971154 0.7002015 0.7254808 0.7500000    0
## BagCART 0.6571429 0.7026328 0.7403846 0.7327930 0.7673993 0.7788462    0
## RF      0.6952381 0.7115385 0.7355769 0.7356410 0.7589286 0.7788462    0
## GBM     0.6666667 0.7026328 0.7224359 0.7213004 0.7482143 0.7692308    0
## 
## Spec 
##              Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## LR      0.8365385 0.8720096 0.8947368 0.8941595 0.9174641 0.9567308    0
## CART    0.8708134 0.9231517 0.9808152 0.9540187 0.9892344 1.0000000    0
## kNN     0.8325359 0.8433014 0.8561028 0.8592013 0.8775591 0.8899522    0
## SVM     0.9090909 0.9327728 0.9401914 0.9391861 0.9497608 0.9663462    0
## BagCART 0.9234450 0.9425837 0.9473684 0.9478124 0.9591921 0.9663462    0
## RF      0.9234450 0.9449186 0.9569378 0.9554679 0.9653110 0.9760766    0
## GBM     0.9425837 0.9557014 0.9712919 0.9669603 0.9760766 0.9856459    0
dotplot(results)

cat('The average ROC from all models is:', mean(c(results$values$`LR~ROC`, results$values$`CART~ROC`, results$values$`kNN~ROC`, results$values$`SVM~ROC`, results$values$`BagCART~ROC`, results$values$`RF~ROC`, results$values$`GBM~ROC`)))
## The average ROC from all models is: 0.8879988

5. Improve Accuracy or Results

After we achieve a short list of machine learning algorithms with good level of accuracy, we can leverage ways to improve the accuracy of the models.

Using the three best-perfoming algorithms from the previous section, we will Search for a combination of parameters for each algorithm that yields the best results.

5.a) Algorithm Tuning

Finally, we will tune the best-performing algorithms from each group further and see whether we can get more accuracy out of them.

# Tuning algorithm #1 - Random Forest
email_notify(paste("Algorithm #1 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@60addb54}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(mtry = c(10, 20, 35, 50, 75))
fit.final1 <- train(targetVar~., data=xy_train, method="rf", metric=metricTarget, tuneGrid=grid, trControl=control)
plot(fit.final1)

print(fit.final1)
## Random Forest 
## 
## 3132 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 2819, 2820, 2818, 2819, 2818, 2819, ... 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##   10    0.9226155  0.7193773  0.9664796
##   20    0.9243642  0.7318498  0.9616926
##   35    0.9237356  0.7346978  0.9593002
##   50    0.9233970  0.7356502  0.9578671
##   75    0.9211255  0.7346795  0.9540348
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 20.
proc.time()-startTimeModule
##    user  system elapsed 
## 739.387   1.535 749.824
email_notify(paste("Algorithm #1 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@90f6bfd}"
# Tuning algorithm #2 - Stochastic Gradient Boosting
email_notify(paste("Algorithm #2 tuning has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@4909b8da}"
startTimeModule <- proc.time()
set.seed(seedNum)
grid <- expand.grid(n.trees = c(100, 200, 300, 400), interaction.depth = 3, shrinkage = 0.1, n.minobsinnode = 10)
fit.final2 <- train(targetVar~., data=xy_train, method="gbm", metric=metricTarget, tuneGrid=grid, trControl=control, verbose=F)
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
## Warning in (function (x, y, offset = NULL, misc = NULL, distribution =
## "bernoulli", : variable 17: MOSTYPE19 has no variation.
plot(fit.final2)

print(fit.final2)
## Stochastic Gradient Boosting 
## 
## 3132 samples
##   85 predictor
##    2 classes: 'Yes', 'No' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 1 times) 
## Summary of sample sizes: 2819, 2820, 2818, 2819, 2818, 2819, ... 
## Resampling results across tuning parameters:
## 
##   n.trees  ROC        Sens       Spec     
##   100      0.9195662  0.7088004  0.9688673
##   200      0.9203971  0.7155403  0.9655180
##   300      0.9185297  0.7250824  0.9636111
##   400      0.9177711  0.7260714  0.9597764
## 
## Tuning parameter 'interaction.depth' was held constant at a value of
##  3
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 200,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
proc.time()-startTimeModule
##    user  system elapsed 
##  56.008   0.062  56.709
email_notify(paste("Algorithm #2 tuning completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@54a097cc}"

5.d) Compare Algorithms After Tuning

results <- resamples(list(CART=fit.final1, RF=fit.final2))
summary(results)
## 
## Call:
## summary.resamples(object = results)
## 
## Models: CART, RF 
## Number of resamples: 10 
## 
## ROC 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART 0.9043062 0.9153167 0.9284367 0.9243642 0.9316074 0.9375693    0
## RF   0.9149009 0.9175709 0.9202309 0.9203971 0.9230207 0.9275610    0
## 
## Sens 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART 0.6857143 0.7160027 0.7259615 0.7318498 0.7475962 0.7788462    0
## RF   0.6634615 0.6858288 0.7176282 0.7155403 0.7458333 0.7692308    0
## 
## Spec 
##           Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## CART 0.9377990 0.9497090 0.9617225 0.9616926 0.9748804 0.9807692    0
## RF   0.9519231 0.9533493 0.9665072 0.9655180 0.9760766 0.9807692    0
dotplot(results)

6. Finalize Model and Present Results

Once we have narrow down to a model that we believe can make accurate predictions on unseen data, we are ready to finalize it. Finalizing a model may involve sub-tasks such as:

email_notify(paste("Model Validation and Final Model Creation has begun!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@735b478}"

6.a) Predictions on validation dataset

predictions <- predict(fit.final1, newdata=xy_test)
confusionMatrix(predictions, y_test)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  Yes   No
##        Yes   48  197
##        No   190 3565
##                                           
##                Accuracy : 0.9032          
##                  95% CI : (0.8937, 0.9122)
##     No Information Rate : 0.9405          
##     P-Value [Acc > NIR] : 1.0000          
##                                           
##                   Kappa : 0.1473          
##  Mcnemar's Test P-Value : 0.7604          
##                                           
##             Sensitivity : 0.20168         
##             Specificity : 0.94763         
##          Pos Pred Value : 0.19592         
##          Neg Pred Value : 0.94940         
##              Prevalence : 0.05950         
##          Detection Rate : 0.01200         
##    Detection Prevalence : 0.06125         
##       Balanced Accuracy : 0.57466         
##                                           
##        'Positive' Class : Yes             
## 
pred <- prediction(as.numeric(predictions), as.numeric(y_test))
perf <- performance(pred, measure = "tpr", x.measure = "fpr")
plot(perf, colorize=TRUE)

auc <- performance(pred, measure = "auc")
cat('The area under the curve (AUC) value is:', auc@y.values[[1]])
## The area under the curve (AUC) value is: 0.5746575

6.b) Create standalone model on entire training dataset

startTimeModule <- proc.time()
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
set.seed(seedNum)

# Combining the training and test datasets to form the original dataset that will be used for training the final model
# xy_train <- rbind(xy_train, xy_test)

finalModel <- randomForest(targetVar~., data=xy_train, mtry=20)
print(finalModel)
## 
## Call:
##  randomForest(formula = targetVar ~ ., data = xy_train, mtry = 20) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 20
## 
##         OOB estimate of  error rate: 11.94%
## Confusion matrix:
##     Yes   No class.error
## Yes 765  279  0.26724138
## No   95 1993  0.04549808
proc.time()-startTimeModule
##    user  system elapsed 
##  11.675   0.048  11.865

6.c) Save model for later use

#saveRDS(finalModel, "./finalModel_BinaryClass.rds")
email_notify(paste("Model Validation and Final Model Creation Completed!",date()))
## [1] "Java-Object{org.apache.commons.mail.SimpleEmail@61dc03ce}"
proc.time()-startTimeScript
##     user   system  elapsed 
## 1618.811    4.050 1680.334